MLOps Initialization

In this project, an MLOps platform was implemented with ClearML to efficiently manage the complete lifecycle of machine learning models. ClearML was chosen because it has all the necessary features to create a robust and flexible MLOps environment. Key features include experiment tracking, a model registry, model serving, remote training, hyperparameter optimization, and additionally is it open source.

Planning and selection of the MLOps software #

The first step in the project was the careful planning and selection of suitable MLOps software. To do this, a comparative analysis was carried out between various options such as ClearML, mlflow and others. A weighted decision matrix helped to make the decision in favor of ClearML, as it offers all the required features and can be seamlessly integrated into our existing workflows.

Installation and deployment #

The ClearML platform was installed in the AWS Cloud. An Orchestrator Node was set up to take over the central control of the platform. In addition, two GPU worker nodes and a CPU worker node were installed to run compute-intensive training jobs.

All components were implemented as Docker containers, which enabled a flexible and scalable deployment strategy. The GPU worker nodes were used to train the models, while Nvidia Triton was used for optimized model serving.

graph TB A[Orchestrator Node] --> B[GPU Worker Node 1] A --> C[GPU Worker Node 2] A --> F[CPU Worker Node] A --> D[Nvidia Triton] E[Development PC] --> A

The data scientist thus develops locally on his PC, on which the experiments are prepared and initially tested. As soon as the experiments were ready for training, they were moved to the GPU worker nodes in the AWS cloud. ClearML enabled seamless integration between local development and training on a GPU cluster, which significantly increased efficiency.

Example experiments #

Two example experiments were carried out to test the new environment.

Breast cancer detection with Random Forest #

The first example experiment dealt with the detection of breast cancer using a random forest model. The data was prepared and the model was developed locally. The experiment was then trained on the CPU Worker Node in the AWS Cloud, using ClearML for experiment tracking and hyperparameter optimization.

Facial expression recognition with deep learning #

The second experiment focused on the recognition of facial expressions using deep learning. A deep neural network was used to classify different emotions in facial images. Again, ClearML played a central role by orchestrating the training on the GPU worker nodes and then registering the model in the model registry.

Conclusion #

By implementing ClearML as the MLOps platform, an efficient and scalable environment was created that covers the entire ML lifecycle. The platform proved to be extremely useful for the automation and traceability of experiments, and enabled complex models to be developed, trained and deployed efficiently. With ClearML as the central solution, the entire workflow from development to production deployment was significantly optimized.

Activities #

Project planning and organization
Selection of the MLOps platform using a weighted decision matrix
Deployment of the ClearML application in the AWS
Configuration of several worker instances (CPU, GPU)
Testing the platform with machine learning projects (Random Forest, Deep Learning)
Presentation and introduction of the platform